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^^bstract 

Qn, this paper, we provide a polynomial time algorithm to 
calculate the probability of a ranked gene tree topology 
j^r a given species tree, where a ranked tree topology is 
^L tree topology with the internal vertices being ordered. 
^The probability of a gene tree topology can thus be cal- 
culated in polynomial time if the number of orderings of 
'""the internal vertices is a polynomial number. However, 
brrhe complexity of calculating the probability of a gene 
^^ree topology with an exponential number of rankings 
. ^dpr a given species tree remains unknown. 






Introduction 



<Phylogenetic reconstruction methods aim to infer the 
^species phylogeny which gave rise to a group of extant 
^pecies. Typically, this species phylogeny is obtained 
Ceased on genetic data from representative individuals of 
C^sich extant species. The ancestries of genes at differ- 
^eiit loci form gene trees which do not necessarily have 
^he same topology as the species tree. Gene tree topolo- 
gies and species tree topologies might be different due to 
• '^uch phenomena as incomplete lineage sorting, gene du- 
rS>lication, recombination within gene loci, and horizontal 
C^ene transfer ^. In this paper, we focus on incomplete 
lineage sorting as the mechanism for incongruence of 
gene tree and species tree topologies, in which two gene 
lineages do not coalesce in the most recent population 
ancestral to the individuals from which the genes were 
sampled. As an example, the lineages sampled from 
species A and B in Figure lb do not coalesce until the 
population ancestral to species A, B, and C, thus allow- 
ing the B and C lineages in the gene tree to have a more 
recent common ancestor than lineages A and B. 

Given a fixed species tree, and assuming the gene tree 
evolved under the multi-species coalescent [1], the most 
probable gene tree topology can have a diff'erent topol- 
ogy from that of the species tree. Such a gene tree topol- 



ogy is called an anomalous gene tree. In fact, for ev- 
ery species tree topology with at least 5 leaves, we can 
choose edge lengths in the species tree topology such 
that anomalous gene trees exist [3]. This implies that 
the gene tree topology appearing most often when con- 
sidering different genes might not agree with the species 
tree topology, thus we cannot use a simple majority- 
heuristic to infer the species tree from a collection of gene 
trees. Instead we need statistical tools rather than ma- 
jority rule heuristics for inferring the species tree based 
on gene trees. 

Current methods for inferring species trees from gene 
trees in this setting can be divided into topology-based 
and genealogy-based methods, in which the input for a 
reconstruction algorithm accepts either gene tree topolo- 
gies or genealogies, i.e., gene trees with branch lengths 
(coalescence times). Topology-based methods include 
Minimize Deep Coalescence (MDC) [IS [28], STAR [H], 
STELLS [33], rooted triple consensus [9] and other con- 
sensus and supertree methods [2l [32] . Genealogy-based 
methods include Bayesian and likelihood methods such 
as BEST, *BEAST, and STEM [IIl[l3l[Il] and cluster- 
ing and distance-based methods ^5\ [ITJ [THJ [20] . Possible 
pros and cons of the two approaches are that topology- 
based methods can be computationally faster and less 
sensitive to errors in estimating gene trees (and gene tree 
branch lengths) from sequence data [12], while methods 
that use coalescence times, particularly using Bayesian 
modelling, can be the most accurate when model as- 
sumptions are correct |16j . 

Another possibility that has been so far unexplored in 
methods for inferring species trees from gene trees is to 
use ranked gene trees, in which the temporal order of the 
nodes of the gene tree (the coalescence times) is used, 
but not the continuous-valued branch lengths. This ap- 
proach might therefore be intermediate between purely 
topology-based methods and genealogy-based methods. 
By preserving more of the temporal information in the 
gene tree nodes, the hope is to develop methods that 
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Figure 1: In (a)-(d) the ranked species tree topology is {{{A, B)^, 0)2, {D, -E)3)i. (a) The ranked gene tree matches 
the ranked species tree, (b) The (ranked or unranked) gene tree does not match the species tree, and there is an 
incomplete hneage sorting event (a deep coalescence) because the lineages from species A and B fail to coalesce 
more recently than S2- (c) The gene tree and species tree have the same unranked topology but have different 
ranked topologies, as D and E coalesce in the gene tree more recently than A and B, while A and B is the most 
recent divergence in the species tree. The gene tree in (c) has ranked topology {{{A, B)3,0)2,{D,E)4)i. In (c), 
there are no incomplete lineage sorting events (no deep coalescences); however, there is an extra lineage at time 
S3 which leads to the gene tree and species tree having different rankings. In (c), all coalescences occur in the 
most recent possible interval consistent with the ranked gene tree, and we have ii = 2,^2 = 3,^3 = 5,^4 = 5, and 
gi = 2, g2 = 3, gs = 5, g^ = 5. (d) A gene tree with the same ranked topology as the gene tree in (c) but with 
coalescences occurring in different intervals. 



are more powerful than purely topology-based methods 
and that are still computationally efficient and robust 
to errors in estimating gene trees and gene tree branch 
lengths from sequence data. 

In [5] , a first step toward developing methods that use 
ranked gene trees for inferring species trees was taken 
by providing formulae to calculate the probability of 
a ranked gene tree given a species tree. The previous 
work, however, was based on an exponential enumera- 
tion of what were called ranked coalescent histories and 



did not provide an algorithm for computing some of the 
key terms in the probability of individual ranked histo- 
ries. In this paper, we improve this previous (compu- 
tationally inefficient) approach, by providing a method 
for computing probabilities of ranked gene trees given 
species trees which is polynomial in the number of leaves 
using a dynamic programming approach. 

Methods for computing probabilities of ranked gene 
trees efficiently may also be of interest in the context of 
computing probabilities of unranked gene trees, partic- 



Table 1: Notation used in the paper 



Symbol meaning 



r 

n 

Si 
Ti 

mi 
9i 

yi,z 

Ui 



h 



■«j,2 



5{y),5{u) 
lca(w) 

A,, 



*«j 



K 



species tree with real-valued divergence times 

ranked gene tree (real-valued coalescence times not specified) 

the number of leaves of T and ^ 

speciation times, with si > • • • > s.„_i, let sq = co 

intervals between speciation times, tj = [sj, Si-i) 

the number of gene tree lineages at time Si 

the number of coalescence events in interval tj 

the ranked gene tree observed from time to time Sj 

the minimum number of gene tree lineages at time Sj 

population z in interval Tj in beaded tree 

internal node (coalescence) with rank i in the gene tree, ui is most ancient, 

Un-i is the most recent 

the number of lineages available for coalescence in population yi^z just after the jth coalescence 

(considered forward in time) in interval tj; A;j^o,z is the number of lineages "exiting" at time Sj_i 

the set of leaves descended from a node of the species tree or gene tree, respectively 

for a node u of the gene tree, the node y of the species tree with largest rank such that 5{u) C 5{y) 

for a node y with rank i on the species tree, we denote T(y) = Tj (the interval immediately above y) 

the overall coalescence rate in interval Tj immediately preceding (backwards in time) 

the jth coalescence 

number of sequences of coalescences above the root of the species tree starting with k lineages 

the joint density of coalescence times in interval tj 



ularly because no polynomial time algorithm has been 
found for calculating the probability of a gene tree topol- 
ogy given a species tree under the multispecies coales- 
cent [HI [231 [ini [33] • The probability of an unranked 
gene tree topology can be obtained by summing over 
all ranked gene tree topologies with the same topology. 
Thus, for unranked gene trees with particular shapes 
where the number of rankings increases in polynomial 
time, using ranked gene trees can potentially increase 
the speed of computing probabilities of unranked gene 
trees as well. We note that a completely unbalanced 
gene tree has only one ranking, while the number of 
rankings can be exponential in the number of leaves 
when gene trees become more balanced. Thus, our ap- 
proach for calculating unranked gene tree probabilities 
will be most useful for less balanced ranked gene trees. 



The bulk of the paper consists of the derivation of the 
polynomial time method for computing ranked gene tree 
probabilities. The algorithm is summarized in section 
2.21 This is followed by a discussion of applications to 
computing probabilities of unranked gene tree topologies 
and to inferring ranked species trees under maximum 
likelihood and a modification to the MDC criterion. 



2 Calculating the probability of a 
ranked gene tree topology 

In the following, we will derive the probability of a 
ranked gene tree topology given a species tree, P[^ | T]. 
Equations (I1I2I3I4I8I1UI) allow the calculation of P[^ | T] 
in time 0{rv'). The model giving rise to the gene tree 
is the multi-species coalescent with constant population 
sizes [1]. Each species consists of a population of con- 
stant size where lineages merge according to the coa- 
lescent. Thus, lineages from two different species may 
coalesce any time previous to the split of the two species. 

We begin with soine notation, which is also summa- 
rized in Table 1. Let time be today and increasing 
going into the past. Let T be a species tree with n 
species, and thus n — 1 speciation events (denoted by 
1, . . . , n — 1) occurring at times si > • • • > Sn-i- Denote 
the interval between speciation event i — 1 and speciation 
event i by Tj, see Figure 1. 

Let ^ be a ranked gene tree topology. It is convenient 
to use the same labels for the leaves of ^ and of T. This 
is a slight abuse of notation, as leaf A oi T refers to 
a population (or species), and ^4 of ^ refers to a gene 
sampled from population A. We denote the nodes of 
^ (which are coalescence events) by ui, . . . , Un-i, where 
node Uj has rank j, and where higher rank indicates a 
more recent coalescence. A ranked tree topology can be 



notated similarly to Newick notation, putting the rank 
as a subscript for each node, see also Figure 1. 

Let §fj^£. be part of a ranked gene tree evolving on a 
species tree between time Si and time (i.e. the present). 
^i^£. consists of £{ gene tree lineages at speciation time 
Si and the coalescent history of "^i/^ in time interval 
(0,Si) is consistent with the ranked gene tree ^. Let 
gi be the minimum number of lineages required in the 
ranked gene tree at time Si such that 1# can be embedded 
into the species tree 7~. Note that n > ii > gi > i. 
Next we provide a dynamic programming approach for 
calculating the probability of a ranked gene tree given a 
species tree. An efficient way to determine the required 
quantities gi, . . . , gn-i is provided in Section [TH 

Essentially, in our approach, we traverse the in- 
tervals between speciation events going back in time, 
rn_i,...,r2 (formalized in Theorem [2]), and calculate 
the probability of the appropriate coalescent events oc- 
curing in interval Tj based on how many coalescent 
events happened in the later intervals tj+i, . . . , t„_i 
(Theorem [3]) . Finally with Theorem [H we account for 
the most ancetral time interval ri. 

Theorem 1. The probability of a ranked gene tree given 
a species tree is, 
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^1=51 



i/i I 'T]/Hii 



where 



He, =h\ih- 1)1/2'^-' 



(1) 



(2) 



is the probability for the coalescences above the root ap- 
pearing in the right order J^. 

For precalculated P[^i^£j7^ (£i = 2, . . . ,n) the com- 
plexity of calculating F['^\T] is thus 0{n). Next, we 
will provide a recursive way to calculate P[^i^fj7^ for 
£i = 2, . . . , n in polynomial time, thus P[^ | T] can be 
calculated in polynomial time. 

Theorem 2. The probability F[^i^i^\T] can be calculated 
for all i recursively (with li > gi), 



n 
^i+l=max(^i,gi+l) 



(3) 



'n-l,n I 71 = L 



with 



The complexity of calculating P[^i ^^ | T] for ii = 
2, ... ,n is 0{n^), given we know P[^j/j | ^i+i/^^-, , T] for 
alli,ii,ii+i. 

Proof. At the time of the most recent speciation event, 
Sn-i, we have n lineages with probability 1, which is the 



initial value of the recursion. Calculating P[^i^£. |71 for 
i < n — 1 can be done in the following way, 

n 

Y, P[^,,,„^,+iA^jT] 

£i+i=max{ei,gi+i) 



E 



i.Wn-u.^.Tjfm+u.jn 



£i+l=max(£i,gi+i) 



Suppose P[^j^^-|^j_|_i^£._|_^,7l is known. Given we calcu- 
lated the probability P[l#j+i^£-^j \T] for ii^i = i+2, . . . , n, 
then calculating F[^i^i.\T] for £« = i -|- 1, ... ,n re- 
quires 0{J2^ZiJ) = 0(("~2^ )) calculations. Sum- 
ming up over i = 1, ... ,n — 1 yields a complexity of 

o(Er=2(2)) = o(rr)) = o(n3). n 

It remains to determine P[^j_i^£._ ^ {^i/^ , T] . Note that 
during the interval Tj, we have i branches in the species 
tree. Let m, be the number of coalescent events in tj, 
so rui = ii — ii-i. Let the number of lineages on branch 
z just after the j'th coalescent event (going forward in 
time) in tj be kij^^- Calculation of fcj,j,z can be done 
efficiently as shown in Section [2.11 

Theorem 3. We have, 

where Ajj = X^^^i ( ''f) and (2) := 0. 

Proof. The density for the coalescence events in interval 
Ti can be obtained by considering the waiting time to 
the "next" coalescent event (going backwards in time) 
as being due to competing exponentials in the different 
branches, where the coalescence rate within branch z is 
( '^''') . Thus, the waiting time until the next coalescent 

event has rate Ajj = X]l=i ( '2'^)- 

We denote the time between the jth and (j + l)st 
coalescent event as Vj, where vq is the time between 
Si-i and the first (least recent) coalescent event in tj 
and with Vrm being the time between Sj and coalescent 
event rui. 

The density for the coalescent events in the interval 
Ti is [5], 



fiivO,Vl,...,Vni,) 






It remains to integrate over v, for which we distinguish 
between case (i) Aj^o = 0, and case (ii) Aj^o > 0. 

Case (i): If Aj^o = (which occurs if ii-i = i, i.e., 
all lineages within each population coalesce), then we 
rewrite fi as. 



fi{VO,Vl,... ,VmJ = -^ 



nr=i ^ 



(5) 



■i=i ''»J 



Using the fact that the integral of the numerator of 
Equation ([5]) is a hypoexponential distribution based on 
the sum of m,j exponential random variables [23] (with 
density functions Xije~'^^'^'"^ , j = 1, . . . , rui), the proba- 
bility of the coalescent events in the interval is the cumu- 
lative distribution function of the hypoexponential dis- 
tribution evaluated at Sj_i — Sj = X^"!!o^j- Thus, with 



-u.-Ma-T] 



E 



e '^'-^ 



\,j {Si—l—Si) 



I ^ g-A,,j(s,_i-Si) 

™' g-A,,j(sj_i-Si) 



determine P[^j_i^^-„-^|^i^^^, 7^ for all possible i, rrii and 
ii. First, we observe that i < ti-i < n, and thus for a 
fixed ii, we have, < nii < £i — i. Second, i < ti < n. 
And third, 2 < i < n — 1. Thus, the number of cal- 
culations needed to calculate P[^j_i^£._j|^j^f^,7^ for all 
possible i, rrii and ii is, 

(n— 1 n t-i—i \ I n—1 n 

E E E-n=o E E(^-^) 
i=2 ^i=i+l mi=0 y \i=2 ii=i+l 

= O (^{n - ^A 
5^ 



O (n^) . 



D 






ni 



A^i,k ~ Xij) 



(6) 



where the second line follows because —Xij = Xifl — K,j- 
Case (ii): If Xifl > 0, then we rewrite fi as, 

fi{vo,Vi,...,Vm,) = fVnr-\ (^) 

1 [j=0 ^i,j 

For integrating /j, we use the fact that the integral 
of the numerator in Equation d?]) is the convolution of 
rui + 1 exponential random variables with parameters 
Aj.o, • • • , Aj^mi) which is the hypoexponential distribu- 
tion. Now, since Ajj- < Aj^+i, we observe, using the 
probability density function of the hypoexponential dis- 
tribution, 

= / fi{vO,Vl,...,Vmi) dv 

J V 

Z^ g-Aij(si_i-Si) 

j^Q \.\.k=0,k^jy^i,k — Xij) 

which is the same expression as for the Aj^o = case ([6]). 
Note that for case (i) we made use of the cumulative dis- 
tribution function of the hypoexponential distribution, 
while for case (ii) we made use of the density function 
of the hypoexponential distribution. Both cases yield 
the same final expression for IP[?^j-i/j_i|^i,4,7^, which 
establishes the proof. D 

Corollary 4. The probabilities P[§^j_i,^^_J^i,f,,7l for 
all possible i, mi and ii (recall that nii = ii — ii-i) are 
calculated in 0{n^), given all Xij. 

Proof. For a fixed i, rrii and ii, we require 0{mf) cal- 
culations to evaluate P[^i_i^f-_J^j^£. , 7^. We need to 



Corollary 5. The quantities Xij can be calculated for 
all possible i, rui, ii and j in 0{n^), given all kij,z- 

Proof. For a fixed i, nrii, ii and j, we require 0{i) calcu- 
lations to evaluate Ajj-. As j = 0, . . . , rrij, with the same 
arguments as in Corollary HI we obtain. 
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E^ E(^-i)' 
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i[n — I] 



,i=2 



O (n^) . 
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We note that the terms P[^j_i^£._J?fj^£.,7^ are anal- 
ogous to the functions gij defined in [271 El]) which 
give the probability that i lineages coalesce into j within 
time t in a single population and are used extensively in 
computing probabilities related to unranked gene trees 
[SI [211 [22l [33] . In particular, if only one population, say 
z* , has coalescence events, then we have 

P[^i-iA_J^iA,7l 

_ 9£,+iAiSi - Sj+i) Yl^-tz' gfc,,o,,,fc,,o,,(s^ - S^+l) 
~ T-r4+i"^i rii+i-k+l\ ' 

a product of gij functions with the denominator count- 
ing the number of sequences in which rrii coalescences 
could have occurred. The terms F[^i-i^i._-^\^i^i.,T] al- 
low for the coalescences to occur in separate popula- 
tions, however, and are constrained by the ranking of 
the gene tree. For example, in interval T3 of Figure 
Ic, there are two coalescences which occur in differ- 
ent populations. If the ranking of the gene tree were 



not important, the branches could be considered in- 
dependent, and the probabihty of this event would be 
5'2,i(s2 — 83)92,1(82 — S3). However, the gene tree ranking 
constrains the coalescence of A and B to be less recent 
than that of D and E, so the probability for events in 
this interval is, 

F[^3,2|^4,3,71 = [52,1(S2-S3)]V2. 

We illustrate that we get the same result from Theorem 
131 there are two coalescence events in interval T3, so we 
use J = 0, 1, 2, and calculate 
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0, 

1, 
2. 



Thus, Equation @ from Theorem [3] evaluates to 



-0{S2-S3) 



3-l(s2-S3) 



-2(S2-S3) 



+ 



(2-0)(l-0) (0-l)(2-l) (0-2)(l-2) 

= i _ g-(s2-S3) , Ig-2{S2-S3) 

2 2 

''1 _ g-(s2-S3)^ 



n! ^ \/2^(n/e)". 



and the denominator can be approximated by, 

n—l k 



ll{c^-l)=ll{2^-l)n/2' 



n = n 



1=1 



i=l 



showing that the ratio grows faster than polynomial in 
n. 

2.1 Calculation oi gi and fcjj^ 

Calculation of gi 

If T and 5f have the same ranked topology, then gi = 
i + 1. In general, to compute gi, we let lca(t(j) be the 



yi,i 




Figure 2: The beaded version of the species tree topol- 
ogy in Figure la-d. 



least common ancestor node on the species tree for a 
node Uj on the ranked gene tree - i.e., the node with 
the largest rank on the species tree which is ancestral to 
all species represented in Uj. For a node y on the species 
tree, let r(y) be the interval immediately above y. For 
example, in Figure Ic, T(lca(n4)) = T3 where u^ is the 
gene tree node with rank 4 — the node ancestral to D 
and E only. We then express gi as 



n— 1 n—l 



n 



^ n ^(-^(IcaK)) > rO 



(8) 



b2,i(s2-S3)]V2. 



Remark 6. The probability of a gene tree topology is 
the sum of the probabilities of each ranked gene tree 
with the given topology. A given tree topology has 
(n — l)!/nr=i i^i ~ ^) fOinkings, where Ci is the num- 
ber of descendant leaves of interior vertex i. A proof 
can be found in 126^ . For a completely balanced tree on 
n = 2 leaves, the number of rankings grows faster than 
polynomial: the numerator can be approximated by. 



j=i+l k=j 



where Tj < ti iff j < i, and where /(•) is an indicator 
function taking the value 1 if the condition holds and 
otherwise 0. Assuming each lca() operation is 0(1) [lOl 
[25]). preprocessing allows all lea terms to be computed in 
0{n) time. Similarly all needed products and the sum in 
Equation ([8]) can each then be computed in 0{n) time. 
Thus, calculating gi,. . . , gn-i can be done in 0{n) time. 

Calculation of fcj j ^ 

We let yi^j be the jth population (read left to right) 
in interval Tj (equivalently, the jth branch or jth node 
subtending the branch). In order to label every popula- 
tion before and after a speciation time Sj uniquely, extra 
nodes can be added to the species tree to form a beaded 
species tree (Figure 2), so that there are i nodes at time 
Sj, i = 1, . . . , n — 1. For each i G {1, . . . , n — 1}, there 
is one node of outdegree 2, and i — 1 nodes of outdegree 
1. Thus, population yjj corresponds to a branch (equiv- 
alently, a node) in the beaded species tree. We denote 
the outdegree of a node y by outdeg{y). 

In the remainder of this section, we compute the val- 
ues ki^j^zi i.e. the number of lineages on branch yi^^ of 
the beaded species tree during the interval immediately 
after the jth coalescence event (going forward in time). 



with fcj,o,z being the number of hneages "exiting" the 
branch at time Sj_i. For example, in Figure lb, we have 



^2,0,1 
^2,0,2 



1) ^'2,1,1 
1, ^2,1,2 



^2,2,1 
^2,2,2 



The value of hj^z depends on the number of lineages 
entering branch i, ii, as well as the number of lineages 
exiting the branch, and not just on the number of coales- 
cence events in the interval. For example, in Figure Ic, 
^2,0,1 = 1 and A:2,i,i = 2, while in Figure Id, A:2,o,i = 2 
and A;2,i,i = 3, although the two gene trees have the 
same ranked topology and m2 = 1 for both cases. 

To determine the terms kij^z we note that the number 
of coalescences that have occurred more recently than 
interval tj is n — ii. In a given interval Tj, we let z^^' 
and z^"^' be the left and right children, respectively, of 
population z of outdegree 2, and let z'^^^ = z^"^' be the 
only child of a node z of outdegree 1. 

The number of lineages available to coalesce in popu- 
lation z of interval r,- is 



outdeg{yi^^) 
Ki,mi,z = 2_^ ^1+1,0,^0) 



(9) 



where the z"-* are the daughter populations (one or two) 
of z. Further, knfl^z = for all z. Since the beaded 
species tree has n^/2 nodes, precalculating outdeg{yi^z) 
requires O(n^). For < j < rrij, we have 

J fejj+i^z — 1 jth coalescence on branch z 
ykijj^i^z otherwise 

(10) 

Consequently, determining a particular /cjj^^ is 0(1). 
Thus determining ki^j^z for all possible i, mj and ii is 
(see also Corollary H]) , 

(n— 1 n ii—i rrii 
E E E Eom 
1=2 ei=i+lmi=Oj=0 

= O {n') . 

Note that taking the sum over all z is not necessary, as 
in all but one branch the kij^z equals the fcjj+i^^. 

2.2 An algorithm 

In summary, we derived an algorithm with runtime 
0{n^) for calculating the probability of a ranked gene 
tree given a species tree on n tips: 



1. Calculate gi, ■ ■ ■ Qn-i using Equation ([8] 

,n;z = l. 



2. Calculate kij^z (for i,j = l 
Equations © and ([lUD. 



,...,.(.,. 



i), usmg 



3. Calculate Xij = Yl^z=i ( '2'^) (^^^ *'J = Ij • • • ,n). 

4. Calculate P[S^i_i,^,_J^i/,,7l (for i = 2,...,n; 
ii-i = gi-i, ... ,n; ii = gi,. . . ,n), using Theorem 

El 

5. Calculate F[^ij^\T] using Theorem [2l 

6. Calculate P[§^ | T] using Theorem [TJ 

3 Discussion 

In this paper, we provide a polynomial-time algorithm 
(O(n^) where n is the number of species) to calculate 
the probability of a ranked gene tree topology given a 
species tree, summarized in Section 12.21 We now dis- 
cuss applying these results to computing probabilities 
of unranked gene tree topologies and to inferring ranked 
species trees. 

3.1 Computing probabilities of unranked 
gene tree topologies 

Previous work on computing probabilities of unranked 
gene tree topologies used the concept of coalescent his- 
tories, which specify the branches in the species tree in 
which each node of the gene tree occurs. An unranked 
gene tree probability can then be computed by enumer- 
ating all coalescent histories and computing the proba- 
bility of each. The number of coalescent histories grows 
at least exponentially when the (unranked) gene tree 
matches the species tree, making this approach compu- 
tationally intensive. Coalescent histories can be enumer- 
ated either recursively (e.g., in PHYLONET [30] or [23]) 
or nonrecursively (COAL [6]). 

A much faster approach using dynamic programming 
similar to that used in this paper is implemented in 
STELLS [33], which conditions on the ancestral con- 
figuration in each branch rather than the number of 
lineages. Here an ancestral configuration keeps track 
not only of the number of lineages in a branch in the 
species tree, but also the particular nodes of the gene 
tree. Different ancestral configurations can potentially 
have the same number of lineages within a population. 
Enumerating ancestral configurations turns out to have 
exponential running time for arbitrarily shaped trees, 
but the number of ancestral configurations is still much 
smaller than the number of coalescent histories. When 
computing probabilities of ranked gene tree topologies, 
however, the ranking specifies the sequence of coales- 
cence events, leading to a unique ancestral configura- 
tion given the number of lineages in a time interval. 
This fortuitously enables probabilities of ranked gene 
tree topologies to be computed in polynomial time. 

We note that although the number of rankings for a 
gene tree is not polynomial in the number of leaves in 



general, the number of rankings can be small for cer- 
tain tree shapes. For example, if the gene tree has a 
caterpillar shape, in which each internal node has a leaf 
as a descendant, then there is only one ranking, and 
thus computing the ranked and unranked gene tree are 
equivalent. For a pseudo-caterpillar, a tree made by re- 
placing the subtree with four leaves of a caterpillar with 
a balanced tree on four leaves [23], there are only two 
rankings possible, and for a hicaterpillar [23] . for which 
the left subtree is a caterpillar with n^ leaves and the 
right subtree is a caterpillar with n — ni leaves, there are 
[Jil'~_i) rankings. Thus computing unranked gene tree 
probabilities by summing ranked gene tree probabilities 
can be done in polynomial time for some tree shapes. 
We note that for the approach used by STELLS, some 
tree shapes can also be computed in polynomial time, in- 
cluding the cases we mentioned that have a polynomial 
number of rankings. An open question is whether there 
are any classes of unranked gene trees which have a poly- 
nomial number of rankings but an exponential number 
of ancestral configurations, or vice versa. 

3.2 Inferring species trees from ranked gene 
trees 

Our fast calculation of the probability of ranked gene 
tree topologies can be used to determine the maximum 
likelihood species tree from a collection of known gene 
trees. Assume we have observed N ranked gene trees 
(i.e., N loci). Now the maximum likelihood species tree 
Tml (with branch lengths on internal branches) is 
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where 



P[^i,...,^^|7l 



TV 

fc=i 






(i) 



IT]"' (11) 



is a multinomial likelihood. Here P[?^fc|7^ can be deter- 
mined with our polynomial-time algorithm, we let ^'*^ 
denote the iih. ranked topology, and ni is the number of 
times ranked topology i is observed, with X]j='i ^i — ^ ■ 
Note in particular that the ranked topology of Tml 
might differ from the most frequent ranked gene tree 
topology^. 

Our derivation of the ranked gene tree probability also 
suggests a way to infer a ranked species tree topology 
from ranked gene tree topologies with a similar flavor as 
the MDC criterion. In MDC, for an input gene tree and 
candidate species tree, the number of extra lineages (lin- 
eages which necessarily fail to coalesce due to topological 
differences between gene and species trees) on each edge 
of the species tree is counted. For MDC, whether the 
edge of the species tree is long or short does not affect 



the deep coalescence cost. In working with ranked gene 
trees, however, we can keep track of the minimum num- 
ber of extra lineages within each time interval tj. The 
total number of extra lineages in this sense is 



i=l 



9i 



(^ + 1) 



(12) 



Minimizing (I12p as a criterion for the ranked species 
tree will tend to penalize long edges of the species tree 
which have multiple lineages persisting through multiple 
species divergence events. As an example, in Figure lb, 
the gene tree has a MDC cost of 1 since there are two 
lineages exiting the population immediately ancestral to 
A and B; however the cost according (I12p is 2 because 
there are two edges on the beaded version of the species 
tree (Figure 2) that each have an extra lineage. In Fig- 
ure Ic, the gene tree has a MDC cost of for the species 
tree since it has the matching unranked topology; how- 
ever, the number of extra lineages from equation ()12p 
is 1. We note that in Figure Ic, interval T3, incom- 
plete lineage sorting (and deep coalescence) have not 
occurred as these concepts are normally used. To cap- 
ture the idea that coalescence has nevertheless occurred 
in a more ancient time interval than allowed, we might 
refer to the coalescence of A and B in Figure Ic as an 
"ancient lineage sorting" event (rather than incomplete 
lineage sorting event) or an ancient coalescence rather 
than a deep coalescence. We could therefore refer to 
minimizing equation (I12p as the Minimize Ancient Co- 
alescence (MAC) criterion, which would provide an in- 
teresting comparison to the usual topology-based MDC 
criterion. 

In practice, a method of inferring a species tree from 
ranked gene trees would require estimating the ranked 
gene trees. This would require clock-like gene trees, or 
trees with times estimated for nodes, which can also 
be inferred under relaxed clock models in BEAST [7]. 
To account for the uncertainty in the gene trees, the 
counts for different ranked gene trees could be weighted 
by their posterior probabilities obtained from Bayesian 
estimation of the gene trees [1]. Thus, in equation (fTT]l . 
we would let rij^ be the posterior probability of ranked 
topology i at locus k, and use rii = J2k=i ''^ik ^^ the 
estimated number of times that ranked topology i was 
observed. Similarly, for equation p^ . the coalescence 
cost at a locus could be distributed over multiple topolo- 
gies weighted by their posterior probabilities. 
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